Term Deposits Subscription Prediction

by Michele Casalgrandi

Background

Care Bank is a bank planning to launch a new campaign to add term deposit subscriptions for its customer base.

A prior campaign had a conversion rate of more than 12%.

For the new campaign, Care Bank intend to use predictive models to increase success ratio and minimize the marketing budget.

Objectives

Data Dictionary

Bank client data:

Related to this campaign:

Related to the previous campaign:

Customer interaction data:

Import libraries and load data

The data set contains 45211 observations with 17 columns each.

There are no null values in the data set.

There are no duplicates in the data set.

EDA

Univariate analysis

Numerical variables distributions

'age' is skewed to the right with a number of outlier. Outliers seem to be legitimate.

balance is highly skewed to the right with a large number of outlier. 3766 customers have negative balance.

We will transform using the square root.

The distribution is much improved.

day

day does not appear on the data dictionary provided. It varies between 1 and 31 so it should be a day of the month.
day distribution doesn't appear to have a clear pattern

duration is highly skewed to the right. Some outlier seem very high (more than 3000 secs i.e. more than 50 minutes).

We will transform with square root.

The distribution is improved, although it's still skewed to the right

campaign

Most values are less than 7

campaign is highly skewed to the right. Some value seem very high. We'll break down by contact to see if we can see a pattern

As all values are positive we will transform with log

Distribution is improved but because the majority of values are at the low end of the scale, the distribution remains skewed.

previous

previous is skewed to the right with one outlier with very high number which is probably a data entry error.

Most entries are zero (no contact with this customer before the current campaign)

We will cap the outlier to the nearest value.

Previous remains highly skewed. We will transform with square root.

The skew is reduced. However, since the majority of values are zero a right skew remains

pdays

Majority of customers were not contacted in the previous campaign and have a value of -1 for pdays. All them had poutcome = 'unknown'.

We will scale using square root.

While the range is now much smaller, the high skew remains as most observation have value of -1.

job has 12 levels including 288 observation with value of 'unknown'

marital has 3 levels, all are appropriate

education has 4 levels, including 1857 with value of 'unknown'

default, housing, loan have 2 levels, 'yes' and 'no' as expected

contact has 3 level including 13020 observations with value of 'unknown'

month has 12 levels as expected

poutcome has 4 level. The majority have value 'unknown' (36959).

Target has 2 levels. It is an imbalanced class with majority with a value of 'no' (39922, 86.8%)

job

job has 12 different levels. The top three levels are 'blue-collar','management' and 'technician' with 22%, 21% and 17% respectively.

marital

Most customers are married with 60%, followed by single and divorced.

education

51% of customers have secondary education followed by tertiary and primary. For 4% of customer education level is unknown.

default

1.8% of customers have a credit default

loan

16% of customers have a personal loan

56% of customers have a housing loan

contact

Most customers (65%) are reached via a cellular phone. Only 6.4% are contacted using a phone.

There is a high percentage with value 'unknown'

month

30% of customers were contacted last in May.

June (12%), July (15%), August (14%) represent the next highest percentages, suggesting most of the current campaign was run between May and August.

poutcome

Most values of poutcome are 'unknown'.

Target

Target is an imbalanced class. 12% percent of customers have a term deposit.

Convert object type columns to category

By-variate analysis

Pair plot observations

Customers with term deposit tend to have a smaller range of campaign with less outliers

duration is a good predictor of Target

However, we will not have a value for duration at prediction time, as this will be known only after the customer has been contacted for the next campaign.

We will drop duration.

Although some days have higher rates of deposit term than others, there is no clear pattern.

Correlations

Data after transformation

Correlation observations

Change categorical variables to category dtype

We'll leave day as numeric as we don't want to create one column for each day using dummy variables.

Job vs Target

Customers with job of 'retired' or 'student' tend to have higher rates of term deposits.

Customers with job of 'management' have the highest absolute number of term deposits.

marital vs Target

Customers who are single have a higher rate of term deposit, followed by divorced and lastly married

Customers with tertiary education have higher rates of term deposit than others.

default vs Target

Customers with no credit default have higher rates of term deposits

housing vs Target

Customers without a housing loan have higher rates of term deposits.

Customers without a personal loan have higher rates of term deposit.

contact vs Target

Customers with contact of 'cellular' and 'telephone' have similar rates of term deposit. 'unknown' has lowest rate.

month vs Target

Customers who have last contact month of 'dec', 'mar', 'oct' and 'sep' have significantly higher rates of term deposits.

poutcome vs Target

Customers with poutcome of 'success' have a significantly higher rate of term deposit.

EDA Summary

By-variate EDA Summary

Data preparation prior to model building

split into test and train

Model Building

Model performance assessment

The model can make two types of wrong predictions:

  1. False positive: predicting a customer will create a term deposit but the customer doesn't. Loss of marketing resources.
  2. False negative: predicting a customer will not create a term deposit but the customer does. Opportunity loss.

As per the objective, the main goal of the bank is to have customers add term deposits with minimal expenditure of resources. The bank wants to minimize false positives, i.e. Maximize precision. The higher the precision the lower the number of customer contacts that do not result in a term loan.

Bagging classifier

The bagging model generalizes well but Precision is very low at 0.279

We will tune the classifier to see if we can improve performance.

There is a slight decrease to the Precision but not significant.

There is still a large number of False Positive

Build Random Forest classifier

As expected the model is overfit with a Precision of 1.0 for train and 0.62 for test.

However, its performance is higher than the bagging classfier with logistic regression.

Recall is very low at 0.195 for test.

Random Forest performance tuning

We will tune the model to see if we can get better performance.

Precision is higher but still low at 0.332 for test and the model is not overfit.

Recall is now at 0.597

Build Decision Tree classifier

The model is overfit. Performance on test data is very low with a precision of 0.292.

We will perform cost complexity tuning to see if we can improve performance.

Decision tree cost complexity pruning

Build decision tree models for all the alphas

We will remove the last classifier as it is a trivial tree with only one node.

Plot precision score for the various alpha values

Although the best model has a higher precision (0.629) , that is done at the expense of recall which is now extremely low (0.178).

There is very high number of false negative, which represent opportunity lost for the bank.

We will try further tuning.

Decision tree hyperparameters tuning

Performance is similar to the cost complexity pruned tree.

There is a high number of false negative (1236) and still significant number of false positive.

Boosting models

Adaboost

The model is not overfit but performance is low. While the Precision is at 0.627 that come at the cost of low recall of 0.195.

We will tune the model hyperparameters.

Precision on the test data is higher at 0.739. However, there is a huge number of false negatives (recall=0.011).

Fit gradient boost model

The gradient boost classifier returns a precision score of 0.644 on test data but it's overfit compared to other models with precision for training at 0.704

Recall is low at 0.205.

We will tune the hyperparameters.

Performance is very similar to the previous model

Train an XGBoost model

The model overfits the data with a train precision of 0.893 for train and 0.588 for test

We will tune the hyperparameters for XGBoost to see if we can get better performance.

The model performance has improved wrt precision (0.729 for test data) but at the expense of recall (0.149).

Stacking Model

We will now train a stacking classifier using the best models built so far.

Performance of the stacking classifier is slightly lower than the xgboost with Precision at 0.7 for test data and recall of 0.168.

Compare models performance

We will choose the tuned adaboost model for our predictions.

Feature importance

Although we have chosen the Adaboost tuned model for our predictions it doesn't have a way to determine feature importance.

We will use the next best model, the tuned xgboost, to assess the importance of the features.

Business Recommendations